skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Shah, Ankit"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available May 15, 2026
  2. Free, publicly-accessible full text available April 24, 2026
  3. Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less." 
    more » « less
    Free, publicly-accessible full text available April 24, 2026
  4. Free, publicly-accessible full text available April 24, 2026
  5. Free, publicly-accessible full text available April 24, 2026
  6. Free, publicly-accessible full text available January 1, 2026
  7. Free, publicly-accessible full text available January 1, 2026
  8. Cybersecurity operations centers (CSOCs) protect organizations by monitoring network traffic and detecting suspicious activities in the form of alerts. The security response team within CSOCs is responsible for investigating and mitigating alerts. However, an imbalance between alert volume and available analysts creates a backlog, putting the network at risk of exploitation. Recent research has focused on improving the alert-management process by triaging alerts, optimizing analyst scheduling, and reducing analyst workload through systematic discarding of alerts. However, these works overlook the delays caused in alert investigations by several factors, including: (i) false or benign alerts contributing to the backlog; (ii) analysts experiencing cognitive burden from repeatedly reviewing unrelated alerts; and (iii) analysts being assigned to alerts that do not match well with their expertise. We propose a novel framework that considers these factors and utilizes machine learning and mathematical optimization methods to dynamically improve throughput during work shifts. The framework achieves efficiency by automating the identification and removal of a portion of benign alerts, forming clusters of similar alerts, and assigning analysts to alerts with matching attributes. Experiments conducted using real-world CSOC data demonstrate a 60.16% reduction in the alert backlog for an 8-h work shift compared to currently employed approach. 
    more » « less